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Abstract 

Introduction. The analysis of approaches to tracking the human body identified problems when capturing movements 
in a three-dimensional coordinate system. The prospects of motion capture systems based on computer vision are noted. 
In existing studies on markerless motion capture systems, positioning is considered only in two-dimensional space. 
Therefore, the research objective is to increase the accuracy of determining the coordinates of the human body in three- 
dimensional coordinates through developing a motion capture method based on computer vision and triangulation 
algorithms. 

Materials and Methods. A method of motion capture was presented, including calibration of several cameras and 
formalization of procedures for detecting a person in a frame using a convolutional neural network. Based on the 
skeletal points obtained from the neural network, a three-dimensional reconstruction of the human body model was 
carried out using various triangulation algorithms. 

Results. Experimental studies have been carried out comparing four triangulation algorithms: direct linear transfer, 
linear least squares method, L2 triangulation, and polynomial methods. The optimal triangulation algorithm 
(polynomial) was determined, providing an error of no more than 2.5 pixels or 1.67 centimeters. 

Discussion and Conclusion. The shortcomings of existing motion capture systems were revealed. The proposed 
method was aimed at improving the accuracy of motion capture in three-dimensional coordinates using computer 
vision. The results obtained were integrated into the human body positioning software in three-dimensional coordinates 


for use in virtual simulators, motion capture systems and remote monitoring. 
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AnnoTayna 

Beedenue. poseyenupiii avasM3 CyWeCTBYIOWIMX MOAXOOB K OTCI@KHBaHHIO Tella YeOBeKa BbIABHI HaMune 
TipoOseM Ip 3axXBaTe JBMKeHH B TpexXMepHol cucTeMe KOOopsHHaT. OTMeyeHa MepCcieKTHBHOCTb CHCTeM 3aXxBaTa 
J{BWOKCHHM Ha OCHOBe KOMIMIbIOTepHOro 3peHHa. B cywecTByIOWJMx HCCIeOBaHHAX 10 OesMapKepHbIM CHCTeMaM 
3axBaTa JIBIDKeHHM paccMaTpHBaeTCA MO3HUMOHMpOBaHHe TOJbKO B JBYMepHOM IpoctpauctTBe. Ilosromy wWeJIbro 
HCCIeAOBAHHA ABJIAJIOCh MOBbILICHHe TOUHOCTH OMpeseweHuA KOOpAMHAaT ueNOBeyecKorO Tesla B TPeXMePHBIX 
KOOpMHaTax 3a C4eT pa3paOoTKH MeTOa 3axXBaTa J[BHXKCHHA Ha OCHOBeE KOMIBbIOTepHOrO 3peHHA HM asIrOPHTMOB 
TpHaHry AHH. 

Mamepuanoi u memooot. Ipenctapien MeToy 3aXBaTa JBWKCHHM, BKIOUAaIOWIM KasIMOpOBKy HeCKOJIbKUX KaMep U 
cbopMasM3alHIo Mpouesyp oOHnapyxXeHHA 4eOBeKa B KajIpe C UMCHOML30BaHHeM CBepTONHON HelipoHHon ceTu. Ha 
OCHOBE IIOJIYYCHHBIX OT HeMPpOHHOM CeTH CKEJICTHBIX TOUCK OCYLIECTBIIACTCA TPeXMepHad PCKOHCTPyKUMA MOJeIM TesIa 
YeIOBeKa C HCHOJb3OBAHHEM pa3JIM4HBIX aIrOpHTMOB TpHaHTry AHH. 

Pe3ynavmamot uccredoeanua. \[popeyeub! 9KCIepHMeHTAaIbHbIle HCCIeqOBaHHA TO CpaBHeHHIO YeTbIPex asIrOpHTMOB 
TPHaHTyJAWAM: MpAMoro JIMHeMHOrO MepeHoca, JMHeMHOrO MeTOa HaMMeHBbIUMX KBaypaTos, L2 Tpuanrynauuu vu 
TIOIMHOMHaIbHOrO) §=9MeTOZ0B. OnpeyeseH ONTHMaJIbHbI asITOpHTM TpHaHryIAWMH (MOJMHOMMAJIbHBIM), 
oOecreunBaloluii WorpeliHocTs He Oosee 2,5 nukceset uu 1,67 caHTuMeTpoB. 

O6cysicoenue U 3aKiIoveHUe. BlaBeHbl HeOCTATKM CyIeCTBYIOWIMX CHCTeM 3axBaTa JBWKeHHA. IIpeqnoxKeHHblit 
MeTO HalipaBsIeH Ha MOBbILICHHe TOYHOCTH 3aXBaTa J[BHXKeHHi B TPeCXMePHbIX KOOPAMHaTaxX C HCMOJIb30BaHHeM 
KOMIIbIoTepHoro 3peHHa. TlomydeHHble pe3yIbTaTbI MHTErPHPOBAHBI B IporpaMMHoe obecriereHHe NO3HIMOHMpOBaHHA 
Tela YeOBeKa B TPCXMEPHBIX KOOPAMHaTax JIA yaIeHHOrO MOHHTOPHHTa, UCHONb30BaHHA B BUPTYAJIbHBIX 


TpeHakepax WU CHCTeCMax 3axBaTa TBWWKeHMI. 


Korouesble cil0Ba: 3aXBaT TBIWKeHHH, BUpTyasIbHadA PeCayIBHOCTb, TPpHaHTyJIAWHA, KOMITbIOTepHOe 3PeCHHA, MaliHHHOoe 
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Introduction. Significant progress has currently been made in the domain of computer vision. Technologies have 
been developed to solve the problems of detecting objects, determining their state, geometric evaluation of the space 
depicted on the frame, and a lot more. As a result, computer vision has become widespread in various spheres of human 


activity, ranging from healthcare and education to entertainment. A rather promising direction is the use of computer 
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vision technologies for three-dimensional reconstruction and positioning of various objects, including people. There is 
fairly large number of systems for determining the absolute position of a person in space, which can be divided into the 
following categories. 

— Systems using inertial sensors and providing the determination of the amount of their movement, as well as the 
change of angles between them, which involves the use of gyroscopes and accelerometers [1]. A well-known 
representative of this category is the Noitom Mocap Perception Neuron [2], which includes up to 32 inertial sensors. 

— Laser position tracking systems based on the use of base stations installed on opposite sides of the room and 
emitting infrared rays, which provide accurate determination of the position and orientation of sensors in space. An 
example of such systems is a virtual reality kit from HTC [3], which have an error of up to 0.1 mm. 

— Systems using magnetic sensors [4] based on the use of a magnetic field to capture human movement, which 
assume the presence of wearable sensors on the user's body. This category includes Polhemus Liberty — a portable 
electromagnetic motion tracking system, considered one of the fastest (sampling rate — 240 Hz). 

— Marker-based optical systems determine the position of objects by markers using a set of cameras. An example is 
Vicon, which has a fairly low error: the average absolute errors of marker tracking are 0.15 mm in static tests, and 
0.2 mm (with corresponding angular errors of 0.3°) in dynamic tests [5]. 

— Marker-free optical systems based on the use of computer vision and machine learning. Examples of such 
technologies are Open space, MediaPipe, Movenext. With their help, human movements can be tracked with an 
accuracy of up to 30 mm [6]. 

After analyzing the listed categories of motion capture systems, it can be concluded that most of the solutions used 
to recognize human actions and movements involve various wearable devices, such as sensors or gloves. Most of these 
devices are bulky due to the large number of sensors and the need for a wired connection. Some systems have high 
accuracy, but they cannot be used due to the size or the presence of electromagnetic interference [7]. Inertial systems 
have a number of problems associated with the accumulation of errors, which limits their use only to relative 
positioning in space. 

Therefore, optical systems for recognizing and tracking user actions are well regarded. To get information about the 
actions and position of the user, frames obtained from the camera are used. Among optical systems, it is worth noting 
those that use markers (the user may be wearing special clothes or certain labels fixed on him), which makes it difficult 
to use them under real conditions. They are more applicable to specially prepared premises (e.g., film studios). 

Systems that do not use any markers allow users to interact more freely with the environment and are more suitable 
for use under real conditions. The significant disadvantages of systems in this line include relatively low accuracy, 
unreliability, and low performance. To a great extent, this may be due to the shortcomings of computer vision 
algorithms used to recognize a person in the frame, as well as the following reasons: the variability of a person’s 
appearance and lighting conditions, partial occlusions owing to the layering of objects in the scene, the complexity of 
the human skeletal structure. 

As a tule, the operation of marker-free motion capture systems is based on an algorithm for evaluating a person's 
posture. Approaches to solving the problem of assessing a person's posture can be divided into top-bottom and bottom- 
up. In top-bottom approaches, first there is a detection of people in the frame, then an assessment of the pose of each 
person found. Algorithms that relate to the bottom-up approach, at the first stage, search for body parts in the frame, 
then group them into poses. As a rule, convolutional neural networks are used for this task, such as YOLO (You Look 
Only Once) [8], SSD (Single Shot Detection) [9], R-CNN (Region CNN) [10], and others. They provide the recognition 
of numerous different objects, including a person or individual body parts with high accuracy. However, one of the 


disadvantages of the solutions listed above is their low performance and slow operation. To solve this problem, there 
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are special frameworks (MoveNet [11], MediaPipe [12], OpenPose [13]) that also use neural networks optimized for 
real-time operation. 

It should be noted that the above algorithms, technologies and approaches of marker-free motion capture systems 
provide positioning in two-dimensional space, which makes it difficult both to determine the distance to objects and 
their sizes, and to track complex movements when, e.g., the user's hands are hidden by his body. Existing solutions in 
the field of stereo cameras can be effective, but they are not very accurate when the object is significantly removed from 
the camera, which happens when tracking the entire human body. In addition, they do not solve the problem of 
occlusions. Thus, the major line of research is the development of a method of motion capture using multiple cameras 
and computer vision technologies. When implementing multi-camera motion capture systems, the problem of 
combining objects from several images inevitably arises, i.e, the need to perform triangulation. Among the 
triangulation methods, linear and iterative linear algorithms can be distinguished. 

Linear triangulation is the most common approach to performing reconstruction of objects in three-dimensional 
space, including such methods as linear-proprietary method, linear least squares method, direct linear transformation, 
which differ in varying degrees of resistance to noise [14]. 

Iterative linear methods are a more robust version of linear triangulation. Conventional linear methods may be less 
accurate when solving problems of triangulation of a set of points, since in this case, the minimized error has no 
geometric meaning (it does not take into account the shape of the skeleton and the rules for connecting points). The key 
idea of iterative linear methods is to adaptively change the weights of linear equations in such a way that the weighted 
equations correspond to errors. Iterative linear methods include L2 and Loo triangulation [15]. 

Thus, within the framework of this study, the following task was set: to develop a method for capturing human 
movements that provides positioning the user's body in three-dimensional coordinates with minimal error and using 
computer vision technologies. The proposed method can be used as a replacement for existing motion capture systems, 
or as part of other algorithms, e.g., for the subsequent classification of a person's condition. This work was aimed at 
increasing the accuracy of determining the poses and coordinates of the human body in three-dimensional coordinates 
by developing motion capture methods based on computer vision. To achieve this goal, it was required to formalize the 
main stages of the process of capturing points of the human body from several cameras, integrate triangulation 
algorithms, choosing among them the optimal one from the point of view of accuracy, carry out the software 
implementation of the proposed method. 

Materials and Methods. Solving the problem of 3D positioning of a person in space includes the following main 
stages: 

— preliminary calibration of a set of cameras; 

— implementation of human detection procedures in the frame, and calculation of skeletal points; 

— calculation of 3D reconstruction of the human body model. 

Let us look at them in more detail. 

The calibration process involves the camera system taking several pictures of a calibration template, on which it is 
easy to identify key points with known relative positions in space. After that, internal and external parameters are 
calculated for each camera. Internal parameters are constant for a particular camera, external parameters depend on the 
location of the cameras relative to each other [16]. Therefore, this step must be performed before the first use of the 
camera system in a given location. 

To calculate the coordinates of a point in three-dimensional space, it is necessary to know the coordinates of its 
projection on the images and the projective matrices of the cameras [17]. Projective matrix P of some camera can be 


represented as a combination of matrices A (containing the internal parameters of the camera) and R (rotation), as 
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well as the displacement vector T , which describe the change of coordinates from the world coordinate system to the 


coordinate system relative to the camera: 


f. 0 c.f te Ns 4 
P=AR|T]=| 0 Fy || ty Me. Be be ls (1) 
0 0 ho Te Ty G 


where (x, y) — coordinates of the projection of a 3D point on the image in pixels; (c,,c,) — coordinates of the central 
point of the camera; (f,, f,,) — focal length in pixels. 


At the second stage, it is required to obtain directly the key (skeletal) points of the human body on each of their 
cameras. To extract skeletal body points from the frame, it is possible to use various machine learning technologies, 
e.g., MoveNet, MediaPipe, OpenPose, and others [18]. As part of this study, it is proposed to use a highly efficient and 
productive Pose module from the MediaPipe library. MediaPipe Pose uses machine learning to accurately track a 
person's body posture, determine 3D landmarks, and mask background segmentation on the entire body from RGB 
video frames. This approach makes it possible to track up to 33 points and provides real-time operation on most modern 
devices. 


Thus, as part of the second stage, a set of 33 points is formed for each i -th camera: 
{xy = (uy Vy) | 7 ML 25--033},7 € (1,2... KH, (2) 
where u,,— coordinate of j — th point on X axis in i-th image; v, — coordinate of j-th point on Y axis in i-th 


image; K — total number of cameras and images. 

At the third stage, the positions of key skeletal points in three-dimensional space are calculated. To obtain data on 
the position of human skeletal points in space, triangulation is performed — finding the coordinates of a 3D point by the 
coordinates of its projections. Triangulation is one of the most important challenges in computer vision, its solution is a 
crucial stage in 3D reconstruction, it affects the accuracy of the entire result [19]. 

Epipolar geometry is fundamental for the three-dimensional reconstruction of the object points based on the position 
values of the projections of the points in the images from all cameras. Its main idea is that 3D points in the scene are 
projected onto lines in the image plane of each camera — epipolar lines. These lines correspond to the intersection of 
the image plane and the plane passing through the camera centers and the 3D point. This idea provides a condition for 
finding pairs of corresponding points on two images: if it is known that point x on the plane of the first image 
corresponds to point x' on the plane of another image, then its projection should lie on the corresponding epipolar line. 
According to this condition, the following relation will be valid for all corresponding pairs of points x <> x': 

x'Fx=0, (3) 
where /’ — fundamental matrix having size 3x3 and rank equal to 2. 

For some point X , given in three-dimensional space, the following projection formula expressed in homogeneous 
coordinates is valid: 

Px, (4) 
where x, = w(u,,v,,1)7 — homogeneous coordinates of some point on the plane of the i-th image (obtained from the 


i-th camera during the second stage), including the position on image u, (on X axis) and v, (on Y axis); w — scale 


factor; P — projection matrix of i -th camera obtained at the first stage. 


Information Technology, Computer Science and Management 


321 


http://vestnik-donstu.ru 


322 


Advanced Engineering Research (Rostov-on-Don). 2023 ;23(3):317—328. eISSN 2687-1653 


To simplify calculations, the projection matrix of the camera is often presented in the following form: 


P =| pt |(PeR*), (5) 


where p/’ — j-th row of matrix P. 


Therefore, equation (4) can be represented as follows: 


wu, = pi X, 
wy, = pet x, (6) 
we p37 X. 


Since w — scale factor, we obtain the following system of equations: 


u,pi"X ~ pit X =0, , 
u, p37 X — pr X =0. 4 


Since X is a homogeneous representation of coordinates in three-dimensional space, then, for their calculation, it is 


necessary to obtain x, and P for at least two cameras. To solve the system of equations (7), 4 algorithms were 


considered [14]: 
— direct linear transfer (DLT); 
— linear least squares method; 
— L2 triangulation; 
— optimal (polynomial) method. 
DLT refers to a linear triangulation algorithm, whose main advantage is the simplicity of its implementation. 


Specifically, in the OpenCV computer vision library there is a ready-made implementation of this algorithm in the 


triangulatePoints method. 


The linear least squares method also refers to linear ones and consists in the fact that the system of homogeneous 
equations (7) is reduced to a system consisting of inhomogeneous equations, for whose solution, the least squares 


method is used. 


L2 triangulation is an iterative method of three-dimensional reconstruction, whose solution is reduced to minimizing 
the reprojection error: 


Yd(x,,x,) > min, (8) 


where x, — coordinate of the projection of the estimated point in the image; x; — projection coordinate calculated 


from formula (4) for an already determined spatial point; d(e) — distance between two points. 


The algorithm of optimal (polynomial) triangulation refers to non-iterative approaches. To solve it, a sextic 
polynomial is required. The minimization criterion for performing three-dimensional reconstruction in this method can 


be defined as follows: 
Ed(x,,2,) > min, (9) 


where 2, — epipolar line corresponding to point x, . 


When using a two-camera system, to minimize error (9), the following sequence of actions must be performed: 


— parametrize the bundle of epipolar lines in the first image using parameter ¢. Thus, the epipolar line in the first 
image can be expressed as 1, (f) ; 
— using fundamental matrix /F’, calculate the corresponding epipolar line 1,(¢) in the second image; 


— express the distance function (9) as a function of f ; 


— perform a search for value ft, at which (9) tends to a minimum. 
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Using the methods of elementary calculus, it is possible to reduce the solution of the minimization problem to 
finding the roots of a sextic polynomial. The calculation of the assumed spatial point is performed using the direct linear 
transfer method (DLT) [17]. 

Summing up the third stage, we get that after successfully solving system (7) and obtaining the world coordinates of 
the key points of the target object (human body), the following set of points H is formed: 

H ={X,|vi(x,=PX,)} (10) 
where X, — world coordinates of the skeletal point of the human body obtained after solving the triangulation 
problem, expressed in centimeters. 

Thus, in this study, the optimization problem, when using two cameras, is reduced to finding triangulation method 


MT : {x,} > H., in which the reprojection error tends to a minimum: 


R=2 > min. (11) 


Research Results. Optimization problem (11) is solved through performing triangulation of 2D object points 
obtained from images of several cameras, in the framework of this study — from two cameras using various algorithms 
listed in the previous section. 

The listed triangulation methods were implemented using OpenCV and NumPy libraries. For comparison, the 
algorithms were integrated into software implementing the method of three-dimensional motion capture. An example of 


the method for reconstructing the entire human skeleton is shown in Figure 1. 


First camera 3D skeleton ff Second camera 
100 


Fig. 1. Example of the method, including recognition of a person on two cameras and construction of a 3D skeleton 


Then, these algorithms were compared by the value of the reprojection error function (11) for all points of the 
skeleton from two images. The comparison of the selected triangulation methods by the error rate, as well as by the time 
of obtaining a solution (computational complexity) for the entire set of skeleton points was carried out. Summary 


comparative diagrams are shown in Figure 2. 
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a) b) 


Fig. 2. Comparison of triangulation methods by metrics: a — by reprojection error; b — by calculation time 


A number of experimental tests were also carried out for the selected triangulation methods. Under testing, the 
calculated lengths of the user's limbs and the absolute deviation of the obtained values from the real ones were 
measured for each approach. The comparison is presented in Table 1. 


Table 1 
Comparison of the accuracy of determining the size of limbs in the process of triangulation 
Body segment DLT Least Squares L2 Polynomial Real value 
Forearm 25.2+ 1.6 30.8 + 0.2 26.6 + 0.5 24.3+0.4 26 
Shin 42.2+2.0 65.34 1.1 44.6+0.7 38.7 + 1.8 41 
Hip 45.7+£2.7 59.5+0.49 48.7 + 1.3 44.1+0.6 45 
Average deviation 2.43 14.58 2.26 1.67 0 
Presented are the average values (in centimeters) after a sample of 10 measurements + standard deviation in the sample 


The developed software includes the following modules: 


— for working with input devices (cameras); 

— to perform calibration and obtain basic camera parameters; 

— to synchronize multiple cameras; 

— for object recognition (user's body and arms); 

— to analyze the location of the found skeletal points; 

— to build real-time visualization. 

When implementing the software, the Python programming language, OpenCV and Matplotlib libraries were used. 
The operation of the system was carried out in several streams: one was responsible for receiving data from cameras, 
the second — for visualization, the third — for sending the received world coordinates of the human body to external 
systems or modules. Using a unified protocol with a data package in JSON format provides integrating the software into 
third-party systems (e.g., Unity game development environments, Unreal Engine, etc.) [20, 21]. 

Discussion and Conclusion. Let us analyze the results of comparing triangulation algorithms by selected metrics, 


shown in Figure 2 and in Table 1. 
During the comparison, it was found that the optimal algorithm for three-dimensional reconstruction was the 
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was no more than 3 %, taking into account the fact that MediaPipe Pose did not fix the upper point of the head and it 
was calculated approximately based on the position of the eyes. When measuring limbs, the error ranged from 0.9 cm to 
2.3 cm, the average was 1.67 (Table 1). Thus, real tests validate the correctness of the choice of the polynomial method. 

Next, we compared the results obtained with existing studies, e.g., described in [22]. The authors also used trained 
networks (OpenPose) to implement a marker-free human recognition system, a camera calibration procedure, and the 
extraction of skeletal points, but placed cameras next to each other to simulate stereo vision. This key difference made it 
possible to recognize human postures within the framework of this study, when some parts of the body overlapped 
others. In addition, using MediaPipe Pose provided tracking 33 skeletal points, not 18, as in the OpenPose-based 
method. The obtained error values generally corresponded to existing studies (the best result in [22] was 2 cm), which 
allowed us to conclude that the proposed approach can be used in practice. Other marker-free systems, e.g., based on 
Kinect [23], also showed comparable results in terms of measurement error (2-5 cm). Thus, the resulting solution 
generally corresponded to the accuracy of existing developments. 

A comparison of the calculation time of a set of points, shown in Figure 2 on the right, demonstrated that the DLT 
algorithm provided the highest performance. However, all algorithms showed acceptable results (to provide a speed of 
30 and even 60 frames per second). Therefore, this metric was not determinative. 

The developed software can be used in various subject areas primarily as a replacement for motion capture systems 
based on inertial sensors. The advantages of the proposed solution are low economic costs for implementation and 
accessibility (transition from highly specialized motion capture suits to common camera-based tools), the possibility of 
parallel capture of body models of several users [24]. 

The scientific novelty of the research consists in a comprehensive approach to formalizing the process of three- 
dimensional positioning of a person using computer vision technologies. It includes preliminary calibration of a set of 
several cameras, formalization of procedures for detecting a person in a frame using an arbitrary neural network to 
obtain skeletal points, as well as calculation of three-dimensional reconstruction of a human body model using various 
triangulation algorithms. The study presents all the necessary calculation formulas and detailed steps to achieve the 
goal — to increase the accuracy of determining the poses and coordinates of the human body in three-dimensional 
coordinates using computer vision technologies. The theoretical results obtained are quite universal and can be used for 
the practical implementation of motion capture systems based on various models of neural networks, and not just 


MediaPipe Pose. 
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